[1] “Dataset variables”
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ rating : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## [1] "Dataset structure"
| X | fixed.acidity | volatile.acidity | citric.acid |
|---|---|---|---|
| Min. : 1.0 | Min. : 4.60 | Min. :0.1200 | Min. :0.000 |
| 1st Qu.: 400.5 | 1st Qu.: 7.10 | 1st Qu.:0.3900 | 1st Qu.:0.090 |
| Median : 800.0 | Median : 7.90 | Median :0.5200 | Median :0.260 |
| Mean : 800.0 | Mean : 8.32 | Mean :0.5278 | Mean :0.271 |
| 3rd Qu.:1199.5 | 3rd Qu.: 9.20 | 3rd Qu.:0.6400 | 3rd Qu.:0.420 |
| Max. :1599.0 | Max. :15.90 | Max. :1.5800 | Max. :1.000 |
| residual.sugar | chlorides | free.sulfur.dioxide |
|---|---|---|
| Min. : 0.900 | Min. :0.01200 | Min. : 1.00 |
| 1st Qu.: 1.900 | 1st Qu.:0.07000 | 1st Qu.: 7.00 |
| Median : 2.200 | Median :0.07900 | Median :14.00 |
| Mean : 2.539 | Mean :0.08747 | Mean :15.87 |
| 3rd Qu.: 2.600 | 3rd Qu.:0.09000 | 3rd Qu.:21.00 |
| Max. :15.500 | Max. :0.61100 | Max. :72.00 |
| total.sulfur.dioxide | density | pH | sulphates |
|---|---|---|---|
| Min. : 6.00 | Min. :0.9901 | Min. :2.740 | Min. :0.3300 |
| 1st Qu.: 22.00 | 1st Qu.:0.9956 | 1st Qu.:3.210 | 1st Qu.:0.5500 |
| Median : 38.00 | Median :0.9968 | Median :3.310 | Median :0.6200 |
| Mean : 46.47 | Mean :0.9967 | Mean :3.311 | Mean :0.6581 |
| 3rd Qu.: 62.00 | 3rd Qu.:0.9978 | 3rd Qu.:3.400 | 3rd Qu.:0.7300 |
| Max. :289.00 | Max. :1.0037 | Max. :4.010 | Max. :2.0000 |
| alcohol | quality | rating |
|---|---|---|
| Min. : 8.40 | 3: 10 | bad : 63 |
| 1st Qu.: 9.50 | 4: 53 | average:1319 |
| Median :10.20 | 5:681 | good : 217 |
| Mean :10.42 | 6:638 | NA |
| 3rd Qu.:11.10 | 7:199 | NA |
| Max. :14.90 | 8: 18 | NA |
First I’m going to explore each individual distribution to get a feel for the data. This will also help me choose the kind of assumptions I can make when applyting statistical tests.
The high concentration of wines in the center region and the lack of outliers might be a problem for generating a predicting model later on.
There is a high concentration of wines with fixed.acidity close to 8 (the median) but there are also some outliers that shift the mean up to 9.2.
The distribution appears bimodal at 0.4 and 0.6 with some outliers in the higher ranges.
Now this is strange distribution. 8% of wines do not present critic acid at all. Maybe a problem in the data collection process?
A high concentration of wines around 2.2 (the median) with some outliers along the higher ranges.
We see a similar distribution with chlorides.
The distributions peaks at around 7 and from then on resembles a long tailed distribution with very few wines over 60.
As expected, this distribution resembles closely the last one.
The distribution for density has a very normal appearence.
pH also looks normally distributed.
For sulphates we see a distribution similar to the ones of residual.sugar and chlorides.
We see the same rapid increase and then long tailed distribution as we saw in sulfur.dioxide. I wonder if there is a correlation between the variables.
There are 1599 observation of wines in the dataset with 12 features . There is one categorical variable (quality) and the others are numerical variables that indicate wine physical and chemical properties of the wine.
Other observations: The median quality is 6, which in the given scale (1-10) is a mediocre wine. The better wine in the sample has a score of 8, and the worst has a score of 3. The dataset is not balanced, that is, there are a more average wines than poor or excelent ones and this might prove challenging when designing a predicting algorithm.
The main feature in the data is quality. I’d like to determine which features determine the quality of wines.
The variables related to acidity (fixed, volatile, citric.acid and pH) might explain some of the variance. I suspect the different acid concentrations might alter the taste of the wine. Also, residual.sugar dictates how sweet a wine is and might also have an influence in taste.
I created a rating variable to improve the later visualizations.
you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Citric.acid stood out from the other distributions. It had (apart from some outliers) an retangularly looking distribution which given the wine quality distribution seems very unexpected.
A correlation table for all variables will help understand the relationships between them.
| fixed.acidity | volatile.acidity | |
|---|---|---|
| fixed.acidity | 1 | -0.2561 |
| volatile.acidity | -0.2561 | 1 |
| citric.acid | 0.6717 | -0.5525 |
| residual.sugar | 0.1148 | 0.001918 |
| chlorides | 0.09371 | 0.0613 |
| free.sulfur.dioxide | -0.1538 | -0.0105 |
| total.sulfur.dioxide | -0.1132 | 0.07647 |
| density | 0.668 | 0.02203 |
| pH | -0.683 | 0.2349 |
| sulphates | 0.183 | -0.261 |
| alcohol | -0.06167 | -0.2023 |
| quality | 0.1241 | -0.3906 |
| citric.acid | residual.sugar | |
|---|---|---|
| fixed.acidity | 0.6717 | 0.1148 |
| volatile.acidity | -0.5525 | 0.001918 |
| citric.acid | 1 | 0.1436 |
| residual.sugar | 0.1436 | 1 |
| chlorides | 0.2038 | 0.05561 |
| free.sulfur.dioxide | -0.06098 | 0.187 |
| total.sulfur.dioxide | 0.03553 | 0.203 |
| density | 0.3649 | 0.3553 |
| pH | -0.5419 | -0.08565 |
| sulphates | 0.3128 | 0.005527 |
| alcohol | 0.1099 | 0.04208 |
| quality | 0.2264 | 0.01373 |
| chlorides | free.sulfur.dioxide | |
|---|---|---|
| fixed.acidity | 0.09371 | -0.1538 |
| volatile.acidity | 0.0613 | -0.0105 |
| citric.acid | 0.2038 | -0.06098 |
| residual.sugar | 0.05561 | 0.187 |
| chlorides | 1 | 0.005562 |
| free.sulfur.dioxide | 0.005562 | 1 |
| total.sulfur.dioxide | 0.0474 | 0.6677 |
| density | 0.2006 | -0.02195 |
| pH | -0.265 | 0.07038 |
| sulphates | 0.3713 | 0.05166 |
| alcohol | -0.2211 | -0.06941 |
| quality | -0.1289 | -0.05066 |
| total.sulfur.dioxide | density | |
|---|---|---|
| fixed.acidity | -0.1132 | 0.668 |
| volatile.acidity | 0.07647 | 0.02203 |
| citric.acid | 0.03553 | 0.3649 |
| residual.sugar | 0.203 | 0.3553 |
| chlorides | 0.0474 | 0.2006 |
| free.sulfur.dioxide | 0.6677 | -0.02195 |
| total.sulfur.dioxide | 1 | 0.07127 |
| density | 0.07127 | 1 |
| pH | -0.06649 | -0.3417 |
| sulphates | 0.04295 | 0.1485 |
| alcohol | -0.2057 | -0.4962 |
| quality | -0.1851 | -0.1749 |
| pH | sulphates | alcohol | |
|---|---|---|---|
| fixed.acidity | -0.683 | 0.183 | -0.06167 |
| volatile.acidity | 0.2349 | -0.261 | -0.2023 |
| citric.acid | -0.5419 | 0.3128 | 0.1099 |
| residual.sugar | -0.08565 | 0.005527 | 0.04208 |
| chlorides | -0.265 | 0.3713 | -0.2211 |
| free.sulfur.dioxide | 0.07038 | 0.05166 | -0.06941 |
| total.sulfur.dioxide | -0.06649 | 0.04295 | -0.2057 |
| density | -0.3417 | 0.1485 | -0.4962 |
| pH | 1 | -0.1966 | 0.2056 |
| sulphates | -0.1966 | 1 | 0.09359 |
| alcohol | 0.2056 | 0.09359 | 1 |
| quality | -0.05773 | 0.2514 | 0.4762 |
| quality | |
|---|---|
| fixed.acidity | 0.1241 |
| volatile.acidity | -0.3906 |
| citric.acid | 0.2264 |
| residual.sugar | 0.01373 |
| chlorides | -0.1289 |
| free.sulfur.dioxide | -0.05066 |
| total.sulfur.dioxide | -0.1851 |
| density | -0.1749 |
| pH | -0.05773 |
| sulphates | 0.2514 |
| alcohol | 0.4762 |
| quality | 1 |
Alcohol has negative correlation with density. This is expected as alcohol is less dense than water.
Volatile.acidity has a positive correlation with pH. This is unexpected as pH is a direct measure of acidity. Maybe the effect of a lurking variable?
Residual.sugar does not show correlation with quality. Free.sulfur.dioxide and total.sulfur.dioxide are highly correlated as expected.
Density has a very strong correlation with fixed.acidity. The variables that have the strongest correlations to quality are volatile.acidity and alcohol.
Let’s use boxplots to further examine the relationship between some varibles and quality.
| quality | mean | median |
|---|---|---|
| 3 | 8.36 | 7.5 |
| 4 | 7.779 | 7.5 |
| 5 | 8.167 | 7.8 |
| 6 | 8.347 | 7.9 |
| 7 | 8.872 | 8.8 |
| 8 | 8.567 | 8.25 |
As the correlation table showed, fixed.acidity seems to have little to no effect on quality.
| quality | mean | median |
|---|---|---|
| 3 | 0.8845 | 0.845 |
| 4 | 0.694 | 0.67 |
| 5 | 0.577 | 0.58 |
| 6 | 0.4975 | 0.49 |
| 7 | 0.4039 | 0.37 |
| 8 | 0.4233 | 0.37 |
volatile.acidity seems to be an unwanted feature is wines. Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.
| quality | mean | median |
|---|---|---|
| 3 | 0.171 | 0.035 |
| 4 | 0.1742 | 0.09 |
| 5 | 0.2437 | 0.23 |
| 6 | 0.2738 | 0.26 |
| 7 | 0.3752 | 0.4 |
| 8 | 0.3911 | 0.42 |
We can see the soft correlation between these two variables. Better wines tend to have higher concentration of citric acid.
| quality | mean | median |
|---|---|---|
| 3 | 2.635 | 2.1 |
| 4 | 2.694 | 2.1 |
| 5 | 2.529 | 2.2 |
| 6 | 2.477 | 2.2 |
| 7 | 2.721 | 2.3 |
| 8 | 2.578 | 2.1 |
Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.
| quality | mean | median |
|---|---|---|
| 3 | 0.1225 | 0.0905 |
| 4 | 0.09068 | 0.08 |
| 5 | 0.09274 | 0.081 |
| 6 | 0.08496 | 0.078 |
| 7 | 0.07659 | 0.073 |
| 8 | 0.06844 | 0.0705 |
Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.
| quality | mean | median |
|---|---|---|
| 3 | 11 | 6 |
| 4 | 12.26 | 11 |
| 5 | 16.98 | 15 |
| 6 | 15.71 | 14 |
| 7 | 14.05 | 11 |
| 8 | 13.28 | 7.5 |
The ranges are really close to each other but it seems too little sulfur dioxide and we get a poor wine, too much and we get an average wine.
| quality | mean | median |
|---|---|---|
| 3 | 24.9 | 15 |
| 4 | 36.25 | 26 |
| 5 | 56.51 | 47 |
| 6 | 40.87 | 35 |
| 7 | 35.02 | 27 |
| 8 | 33.44 | 21.5 |
As a superset of free.sulfur.dioxide there is no surprise to find a very similar distribution here.
| quality | mean | median |
|---|---|---|
| 3 | 0.9975 | 0.9976 |
| 4 | 0.9965 | 0.9965 |
| 5 | 0.9971 | 0.997 |
| 6 | 0.9966 | 0.9966 |
| 7 | 0.9961 | 0.9958 |
| 8 | 0.9952 | 0.9949 |
Better wines tend to have lower densities, but this is probably due to the alcohol concentration. I wonder if density still has an effect if we hold alcohol constant.
| quality | mean | median |
|---|---|---|
| 3 | 3.398 | 3.39 |
| 4 | 3.382 | 3.37 |
| 5 | 3.305 | 3.3 |
| 6 | 3.318 | 3.32 |
| 7 | 3.291 | 3.28 |
| 8 | 3.267 | 3.23 |
Altough there is definitely a trend (better wines being more acid) there are some outliers.I wonder how the distribution of the different acids affects this.
Let’s examine how each acid concentration affects pH.
It is really strange that an acid concentration would have a positive correlation with pH. Maybe Simpsons Paradox?
When we clusterize the data and recalculate the regression coefficients there is change in sign which indicated that there is in fact a lurking variable that distorts the overall coefficient, indicating the presence of Simpsons Paradox.
Because we know pH measures acid concentration using a log scale, it is not surprise to find stronger correlations between pH the log of the acid concentrations. We can investigate how much of the variance in pH these tree acidity variables can explain using a linear model.
##
## Call:
## lm(formula = pH ~ I(log10(citric.acid)) + I(log10(volatile.acidity)) +
## I(log10(fixed.acidity)), data = subset(wine, citric.acid >
## 0))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47184 -0.06318 -0.00003 0.06447 0.32265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.230862 0.040578 104.266 < 2e-16 ***
## I(log10(citric.acid)) -0.052187 0.008797 -5.933 3.72e-09 ***
## I(log10(volatile.acidity)) -0.049788 0.021248 -2.343 0.0193 *
## I(log10(fixed.acidity)) -1.071983 0.038987 -27.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1068 on 1463 degrees of freedom
## Multiple R-squared: 0.4876, Adjusted R-squared: 0.4866
## F-statistic: 464.1 on 3 and 1463 DF, p-value: < 2.2e-16
It seems the three acidity variables can only explain half the variance in PH. The mean error is specially bad on poor and on excellent wines. This leads me to believe that there are other component that affect acidity.
Interesting. Altough there are many outliers in the medium wines, better wines seem to have a higher concentration of sulphates.
The correlation is clear here. With an increase in alcohol graduation we see an increase in the concentration of better graded wines. Given the high number of outliers it seems we cannot rely on alcohol alone to produce better wines. Let’s try using a simple linear model to investigate.
##
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.12503 0.17471 -0.716 0.474
## alcohol 0.36084 0.01668 21.639 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
Based on the R-squared value it seems alcohol alone only explains about 22% of the variance in quality. We’re going to need to look at the other variables to generate a better model.
investigation. How did the feature(s) of interest vary with other features in the dataset?
Fixed.acidity seems to have little to no effect on quality
Quality seems to go up when volatile.acidity goes down. The higher ranges seem to produce more average and poor wines.
Better wines tend to have higher concentration of citric acid.
Contrary to what I initially expected residual.sugar apparently seems to have little to no effect on perceived quality.
Altough weakly correlated, a lower concentration of chlorides seem to produce better wines.
Better wines tend to have lower densities.
In terms of pH it seems better wines are more acid but there were many outliers. Better wines also seem to have a higher concentration of sulphates.
Alcohol graduation has a strong correlation with quality, but like the linear model showed us it cannot explain all the variance alone. We’re going to need to look at the other variables to generate a better model.
I verified the strong relation between free and total sulfur.dioxide.
I also checked the relation between the acid concentration and pH. Of those, only volatile.acidity surprised me with a positive coefficient for the linear model.
The relationship between the variables total.sulfur.dioxide and free.sulfur.dioxide.
Lets try using multivariate plots to answer some questions that arised earlier and to look for other relationships in the data.
When we hold alcohol constant, there is no evidence that density affects quality which confirms our earlier suspicion.
Interesting! It seems that for wines with high alcohol content, having a higher concentration of sulphates produces better wines.
The reverse seems to be true for volatile acidity. Having less acetic acid on higher concentration of alcohol seems to produce better wines.
Low pH and high alcohol concentration seem to be a good match.
Using multivariate plots we should be able to investigate further the relationship between the acids and quality.
Almost no variance in the y axis compared to the x axis. Lets try the other acids.
High citric acid and low acetic acid seems like a good combination.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$fixed.acidity
## t = 36.2341, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
Altough there seems to a correlation between tartaric acid and citric acid concentrations, nothing stands out in terms of quality.
Now I’m going to use the most prominent variables to generate some linear models and compare them.
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = training_data)
## m5: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid + fixed.acidity, data = training_data)
## m6: lm(formula = as.numeric(quality) ~ alcohol + sulphates + pH,
## data = training_data)
##
## =============================================================================
## m1 m2 m3 m4 m5 m6
## -----------------------------------------------------------------------------
## (Intercept) -0.066 -0.604** 0.605* 0.670** 0.294 1.328*
## (0.220) (0.224) (0.248) (0.257) (0.289) (0.516)
## alcohol 0.357*** 0.339*** 0.306*** 0.305*** 0.315*** 0.362***
## (0.021) (0.020) (0.020) (0.020) (0.020) (0.021)
## sulphates 1.099*** 0.745*** 0.770*** 0.780*** 0.980***
## (0.138) (0.137) (0.139) (0.138) (0.139)
## volatile.acidity -1.199*** -1.272*** -1.333***
## (0.125) (0.146) (0.147)
## citric.acid -0.128 -0.436*
## (0.130) (0.170)
## fixed.acidity 0.047**
## (0.017)
## pH -0.631***
## (0.152)
## -----------------------------------------------------------------------------
## R-squared 0.232 0.280 0.343 0.344 0.349 0.293
## adj. R-squared 0.231 0.279 0.341 0.341 0.346 0.291
## sigma 0.704 0.682 0.651 0.651 0.649 0.676
## F 289.048 185.949 166.182 124.873 102.212 131.779
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1022.548 -991.540 -947.687 -947.203 -943.227 -983.004
## Deviance 473.685 444.023 405.216 404.808 401.465 436.188
## AIC 2051.096 1991.080 1905.374 1906.407 1900.454 1976.008
## BIC 2065.693 2010.544 1929.704 1935.602 1934.516 2000.337
## N 959 959 959 959 959 959
## =============================================================================
Notice I did not include pH in the same formula with the acids to avoid colinearity problems.
High alcohol contents and high sulphate concentrations combined seem to produce better wines.
Yes, I created several models. The most prominent of them was composed of the variables alcohol, sulphates, and the acid variables. There are two problems with it. First the low R squared score suggest that there is missing information to propely predict quality. Second, both the residuals plot and the cross validation favors average wines. This is probably a reflection of the high number of average wines in the training dataset or it could mean that there is missing information that would help predict the edge cases. I hope that the next course in the nanodegree will help me generate better models :) .
This is a very strange distribution. It does not match what we would expect from a variable collected in a experimental situation.
High alcohol contents and high sulphate concentrations combined seem to produce better wines.
The linear model with the highest R squared value could only explain around 35% of the variance in quality. Also, the clear correlation showed by the residual plot earlier seems to reinforce that there is missing information to better predict both poor and excellent wines.
The wine data set contains information on the chemical properties of a selection of wines collected in 2009. It also includes sensorial data (wine ranking).
I started by looking at the individual distributions of the variables, trying to get a feel for each one.
The first thing I noticed was the high concentration of wines in the middle ranges of the ranking, that is, average tasting wines. This proved to be very problematic during the analysis as I kept questioning myself wether there was a true correlation between two variables or it was just a coincidence given the lack of “outlier” (poor and excellent) wines.
Out of the chemical varibles, the only one that stood out was the concentration of citric acid (variable name citric.acid). First thing i noticed was the high number of wines that had no citric.acid at all. My initial thought was a data collection error, but upon researching the subject, I found out that citric acid is sometimes added to wines to boost overall acidity, so it makes sense that some wines would have none. Nonetheless this variable also showed a strange distribution with some peaks but showing an almost rectangular distribution specially in the 0-0.5 range.
All of the other variables showed either an normal or long tailed looking distribution.
After exploring the individual variables, I proceded to investigate the relationships between each input variable and the outcome variable quality.
The most promissing varibles were alcohol concentration, sulphates and the individual acid concentrations.
I also tried investigating the effect of each acid in the overall pH for the wine. I used scatterplots to explore the relationships graphically and also generated a linear model to check how much of pH the three variables accounted for.
The first surprise here was finding that the correlation between acetic acid concentration and pH was positive. I immediately suspected this was the result of some lurking variable (Simpsons paradox) and with the help of the “Simpsons” package I confirmed that suspicion.
The second finding was discovering that the concentration of the three acids only account for less than half of the variance in pH. I interpreted this as a sign that there more components affecting acidity that were not measured.
On the final part of the analysis I tried using multivariate plots to investigate if there were interesting combinations of variables that might affect quality. I also used a multivariate plot to confirm that density did not have an effect on quality when holding alcohol concentration constant.
In the end, the produced model could not explain much of the variance in quality. This is further corroborated acidity analysis.
For future studies, it would be interesting to mesure more acid types in the analysis. Wikipedia for example, suggests that malic and lactic acid are important in wine taste and these were not included in this sample.
Also, I think it would be interesting to include each wine critic judgement as separate entry in the dataset. After all, each individual has a different taste and is subject to prejudice and other distorting factors. I believe that having this extra information would add more value to the analysis.